Add multimodal embedding & rerank support#66
Conversation
|
It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx. |
|
Actually I am btw, Here is my usage doc = [{"type": "text", "text": f"Name: {filepath.name}"},
{"type": "image", "image": image_data}]
class RAGModel:
def __init__(self):
self._model = LlamaEmbedding(
# ...
mmproj_path=...,
image_min_tokens=...,
image_max_tokens=...,
)
def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
files = []
image_id = 0
# Should not manually concat chat template here...
tmpl = f"<|im_start|>system\n{instruct}<|im_end|>\n<|im_start|>user\n"
for item in contents:
type = item['type']
if type == 'text':
tmpl += item['text']
elif type == 'image':
image_id += 1
files.append(item['image'])
tmpl += f"Picture {image_id}: <__media__>" # __media__ is placeholder in mtmd
return tmpl + "<|im_end|>\n<|im_start|>assistant\n", files
def embed_document(self, contents: List[Dict[str, any]], instruction: str = "Represent the user's input.", return_count: bool = False) -> List[float]:
text, files = self._tmpl(contents, instruction)
return self._model.embed_multimodal(text, files, return_count=return_count) |
|
Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage. |
(cherry picked from commit 4ba212f)
|
by def __init__():
eos_token_id = self._model.token_eos()
bos_token_id = self._model.token_bos()
eos_token = (
self._model._model.token_get_text(eos_token_id) if eos_token_id != -1 else ""
)
bos_token = (
self._model._model.token_get_text(bos_token_id) if bos_token_id != -1 else ""
)
self._formatter = Jinja2MultimodalChatFormatter(
template=self._model.metadata['tokenizer.chat_template'],
eos_token=eos_token,
bos_token=bos_token,
stop_token_ids=[eos_token_id]
)
def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
result = self._formatter([{
"role": "system",
"content": instruct
}, {
"role": "user",
"content": contents
}])
return result.prompt, result.mediasContents can be image or audio, support local disk, network url, or bytes/bytearray instance, no video support yet. I thought create_completion is too complex, too, I will create alternate function instead (avoid breaking change) |
|
你好 @roj234 ,这个PR可以保持继续适配优化,我要先对batch decode和eval的部分进行一些重构,原来的老的执行逻辑会有不对齐的情况,导致新模型运行第一轮后kv cache对不上,这次叠加ggml-org/llama.cpp@2b6dfe8 的变更,我就干脆按照llama.cpp目前比较新的方式进行重构,这会对Embedding的部分有一些干扰,但应该是值得的。 |
|
好的,我计划的修改是,除了添加的LlamaEmbedding.embed_multimodal函数之外,再创建一个类似Llama.create_multimodal_chat_completion的函数,它能直接处理请求中的image\audio或者未来的video对象(我看了Qwen VL的代码,它的video实现是用ffmpeg把视频切成nFPS的图片序列,不过不排除未来有新方式的可能,那时候要看mtmd库怎么实现了) |
It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)